Correlations of record events as a test for heavy-tailed distributions 
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A record is an entry in a time series that is larger or smaller than all previous entries. If the time 
series consists of independent, identically distributed random variables with a superimposed linear 
trend, record events are positively (negatively) correlated when the tail of the distribution is heavier 
(lighter) than exponential. Here we use these correlations to detect heavy-tailed behavior in small 
sets of independent random variables. The method consists of converting random subsets of the 
data into time series with a tunable linear drift and computing the resulting record correlations. 



Determining the probability distribution underlying a 
given data set or at least its behavior for large argument 
is of pivotal importance for predicting the behavior of 
the system: If the data is drawn from a distribution with 
heavy tails, one needs to prepare for large events. Of par- 
ticular relevance is the case when the probability density 
displays a power law decay, as this implies a drastic en- 
hancement of the probability of extreme events. This is 
one of the reasons for the persistent interest in the ob- 
servation and modeling of power law distributions, which 
have been associated with critical, scale-invariant behav- 
ior [H 2| in diverse contexts ranging from complex net- 
works 3j to paleontology [Jl , foraging behavior of animals 
0, citation distributions 'u\ and many more 0- 

However, when trying to infer the tail behavior of the 
underlying distribution from a finite data set, one faces 
the problem that the number of entries of large absolute 
value is very small. This implies that even though bin- 
ning the entries by magnitude and plotting them would 
yield an approximate representation of the probability 
density, this process becomes inconclusive in particular 
in the tail of the probability density. Furthermore, in 
small data sets, extreme outliers can strongly affect the 
results of methods like maximum likelihood estimators 
such that leaving out even one of these extreme and pos- 
sibly spurious data points renders the outcome of the test 
insignificant. A case in point is the problem of estimating 
the distribution of fitness effects of beneficial mutations 
in evolution experiments, which are expected on theoreti- 
cal ground to conform to one of the universality classes of 
extreme value theory (EVT) 8] . Because beneficial mu- 
tations are rare, the corresponding data sets are typically 
limited to a few dozen values, and the determination of 
the tail behavior can be very challenging [qI, • 

In this Letter we present a method for detecting heavy 
tails in empirical data that works reliably for small data 
sets (on the order of a few dozen entries) and is robust 
with respect to removal of extreme entries. The test is 
based on the statistics of records of subsamples of the 
data set. Similar to conventional record-based statisti- 
cal tests fll|-(l3t, and in contrast to the bulk of methods 
available in this field Q , our approach is non-parametric 
and, hence, does not require any hypothesis about the un- 
derlying distribution. Rather than aiming at reliable es- 



timates of the parameters of the distribution (such as the 
power law exponent), the main purpose of our method is 
to distinguish between distributions that are heavy-tailed 
and those that are not. 

Record statistics and record correlations. Given a time 
series {xi, . . . ,xn} of random variables (RVs), the n*'* 
RV is said to be a record if it exceeds all previous RVs 
{^j}j<n [lljlij- For independent, identically distributed 
(i.i.d.) RVs, it is straightforward to see that the prob- 
ability Pn for the n*'' entry to be a record is simply 
p„ = l/n, because any of the n RVs is equally likely to 
be the largest. Furthermore, record events are stochasti- 
cally independent in this case 12|, and hence the joint 
probability that both Xn-i and x„ are records fac- 

tOrizeS to p„,n-l = PnPn-l 

In a recent surge of interest IB- 13], record statistics 
has been explored beyond the classical situation of i.i.d. 
RVs, and it has been found that the stochastic indepen- 
dence of record events is largely restricted to the i.i.d. 
case. In particular, for time series constructed from the 



linear drift model 3, 23| 



(1) 



where c > is a constant and {rjn} a family of i.i.d. 
RVs with distribution F{r]) and density /(??), correlations 
between record events were quantified by considering the 
ratio [H 



^n,n— 1 (c) 



Pn,K-l(c) 

p„(c)p„_i(c) ■ 



(2) 



For stochastically independent record events, ^n,„_i(c) — 
1 and any positive (negative) deviation from unity can be 
interpreted as the a sign of attraction (repulsion) between 
record events. In [2h] both cases were found depending 
on the distribution F{t]). Specifically, an expansion to 
first order in c yields ln,n-i{c) = 1 -f cJ{n) + 0{c^) with 



J{n) 



1^4 



(/(n) — I{n — 1)) — n^I{n) where 



/(n)= / rfr,/^(r,)F"(77) 



(3) 



and clearly /(n) — I{n — 1) < 0. Thus for large n, there 
are two competing contributions to J{n) determining the 
sign of the correlations. 
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To classify the behavior of the correlations in terms of 
the EVT classes [H , consider the generalized Pareto 
distribution ^ /(r;) = (1 + k7/)-('^+i)/«, which re- 
produces the three classes as k < (Weibull), k > 
(Frechet) and k = (Gumbel), respectively. Computing 
I{n) separately for these three cases [ll] it was shown 
that, up to multiplicative terms in log fn) or slower, one 
has I{n) ~ ^^-^^ therefore [llf J{n) w !^n^I{n), 

showing that the sign of correlations is directly deter- 
mined by the extreme value index k (25j . 

In the Gumbel class {k = 0) more refined calcu- 
lations for the generalized Gaussian densities ff}{x) ~ 
exp(— |77p) show that correlations are negative for /3 > 1 



and positive for /3 < 1 [21[. The marginal case of a 



pure exponential distribution also shows positive corre- 
lations, but they can be distinguished from the /? < 1 
case in magnitude and, more clearly, in their n depen- 
dence: While for /3 < 1, correlations grow with n up to a 
limiting value, for /3 = 1 they are independent of n. The 
special, marginal role of the exponential distribution was 
also encountered in a study of near-extreme events [23 |. 
where the integral ([3]) appears in a different context. 

To sum up, correlations between record events in time 
series with a linear drift allow a clear distinction between 
underlying probability densities that decay like an expo- 
nential or faster for large argument, and densities with 
heavier tails, by looking for positive correlations that 
grow in n. Using these two criteria, we now present a 
distribution-free test for heavy tails in data sets of i.i.d. 
random variables. 

Description of the test. Consider a data set with N 
entries, xi,X2, ■ ■ ■ , xn that can reasonably be argued to 
consist of independent samples from the same distribu- 
tion [26]. Then for each n < N, one can pick uniformly 
at random a subset of n entries and add a linear trend 
according to the index in the subset (see Eq.([T])), thus 
forming a set of random variables with linear trend. For 
each n, there are (^) possible subsets [l^], which can 
be used to compute the fraction of times the n*^ en- 
try is a record Pn{c), the corresponding fraction p„_i(c) 
for the n — 1*'' entry, and the fraction p„_„_i(c) of times 
both entries are records, for each value of a suitably cho- 
sen range of c [1^. The number s of subsets used for 
each value of c will be referred to as 'internal statis- 
tics'. Finally, one obtains an estimate for the correlations 
f„.„_i(c) = P"j^)P"-i(^) ^ where the hat serves to indicate 
that we are dealing with one fixed times series of length 
N and its sub-series, rather than many independent re- 
alizations. In the following we refer to /„_„_i(c) as the 
heavy tail indicator (HTI). 

To see how the test works in practice, consider Fig. [1] 
Two data sets of size TV = 64 each are presented, one 
drawn from a standard Gaussian distribution, the other 
from a symmetric Levy stable distribution with parame- 
ter fjL = 1.3. A standard approach to inferring the shape 
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Figure 1. A first example of the proposed test, with TV = 64 
i.i.d. RVs drawn from a Gaussian with unit variance (squares) 
and a symmetric Levy distribution ^/^(a;) with = 1.3 (cir- 
cles). Inset: Comparing the cumulative distribution function 
F[x) (lines) to its empirical estimate from the 64 data points 
shows that one distribution is broader than the other but 
does not allow for a clear distinction between the two data 
sets. Main plot: This difference is however clearly seen un- 
der application of the record-based test for subsamples of size 
n = 16. Dotted and dashed-dotted lines show the prediction 
for Zi6, 15(c) for independent RVs. 



of the distribution is to estimate the cumulative distribu- 
tion function by rank ordering the data along the x-axis 
(inset). In the example this shows that one distribution is 
broader than the other, but does not allow to distinguish 
between a difference in scale (as for two Gaussians of dif- 
ferent standard deviation) and a difference in shape. In 
contrast, the two data sets come apart quite clearly un- 
der application of the test, showing that Z„^„_i(c) > 1 for 
the Levy distribution and Z„^„_i(c) < 1 for the Gaussian 
(main figure). 

Fluctuations. The lines in the main part of Fig.[T]show 
the predicted correlation /„_„_i(c) obtained from simula- 
tions of independent RVs. The estimated HTI [„_„_i(c) 
obtained from subsamples of the two finite data sets de- 
viates from these predictions, refiecting the fact that the 
ensemble of subsamples is not independent. The devi- 
ations depend on the data set in a random way, com- 
pare to Fig. m and understanding how the magnitude 
of the deviations depends on the test parameters N , n 
and s is clearly important for a quantitative assessment 
of the significance of the test. Figure [2] explores these 
sample-to-sample fluctuations by computing Z„_„_i(c) for 
a large number S ('external statistics') of different data 
sets and recording the mean and the mean squared devia- 
tion for different distributions. The fluctuations are large 
for power law distributions and decrease significantly for 
representatives of the Gumbel and Weibull classes. The 
latter implies that it is very unlikely for positive corre- 
lations to be produced by chance if the underlying dis- 
tribution is not of heavy tail type; the observation of a 
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Figure 2. Sample-to-sample fluctuations of the HTI /„,„-i for 
different distributions. Lines show mean values of the correla- 
tion /i6, 15(c) obtained from simulations of independent RV's 
(labeled exact), symbols show the mean value of the HTI 
and error bars indicate the standard deviation of the fluctu- 
ations for the symmetric Levy-stable distribution with tail- 
parameter fj, — 1.3 and uniform distribution on (0,1) (top), 
and the Pareto-distribution with /j, = 2.0 and standard normal 
distribution (bottom). The HTI was obtained from simula- 
tions with internal statistics s = 10* (Pareto) or s = 10^ (all 
other) and averaged over 5* = 10^ independent data sets. In- 
sets show how the correlations at the value c* = 0.25 where 
correlations deviate maximally from unity grow as function 
of n while keeping A'^ fixed. 



HTI exceeding unity can therefore be taken as a strong 
indication of heavy tailed behavior in the data. 

The effect of the internal statistics on the sample-to- 
sample fluctuations is quantified in Fig|3l where their 
magnitude can be seen to saturate to a limiting value 
with increasing s. Furthermore the limiting value de- 
pends on the ratio n/N: The smaller a subset of the 
initial data set is used, the more precise the results can 
be made by using large internal statistics. This behavior 
underlines a particular strength of our approach, namely 
that the combinatorially large number of subsequences 
can be used (up to a point) to reduce fluctuations due 
to the finite size of the data set. On the other hand, n 
should not be chosen too small, as the amplitude of corre- 



Levy 
Gauss 
uniform 



0.6 



0.4 
0.3 
0.2 
0.1 











° Gauss, N=32 » 




N=64 


□ 


unit., N=32 X - 


□ 


N=64 




Levy, N=32 * - 




« * N=64 






6 is 6,6 8 8 


8 5,,, 



0.5 0.6 

n/N 




Internal statistics s (Ioqio scale) 

Figure 3. Magnitude of sample-to-sample fluctuations for 
three of the cases considered in Fig[2l With increasing in- 
ternal statistics s, the sample-to-sample fluctuations decrease 
to a limiting value (main plot). This limiting value increases 
with n/N (inset), indicating that best results in terms of fluc- 
tuations are obtained by considering short subsequences. In 
the main plot A'' = 64 and n = 16. 



lations generally increases with n [2l| (see inset of Figl^]) . 
For the examples presented here, we found n/N = 1/4 
at = 64 to yield the best compromise between these 
two contradicting requirements, see also FiglS] 

Application. As an application of our approach, we 
consider the ISI citation data set first analyzed by Red- 
ner , consisting of citation data for 783339 papers pub- 
lished in 1981 and cited between 1981 and Jrme 1997. 
Due to the large size of this data set, the existence of a 
ower law tail with exponent /i w 2 is well established 
0, [S^ . Using our record-based approach, the heavy- 
tailed property could be recovered by considering small, 
randomly chosen subsets of only = 64 papers each 
(Fig. 13]). Despite the substantial fiuctuations between 
the three subsets, the HTI lies clearly above unity in all 
cases. The small size of the chosen subsets implies that 
only a few (if any) data points in the subsets come from 
the extreme tails of the distribution. The lower panel in 
FigU illustrates the robustness of the test with respect 
to the removal of putative outliers. 

Summary. In conclusion, in this Letter we propose 
a record-based distribution-free test for heavy tails that 
works particularly well for small data sets. It was shown 
that the test is very versatile and quite robust to the 
removal of outliers, thus complementingstandard meth- 
ods like maximum likelihood estimates [75 . While record 
statistics has a long history of yielding distribution free 
13|, our approach is conceptually novel in that 
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tests 

we make systematic use of the combinatorial proliferation 
of subsets of the original data set, which are then manip- 
ulated by adding a linear drift. We expect our method to 
be particularly useful in situations where the size of the 
data set is intrinsically limited, as in the assignement of 
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Figure 4. Top: Three randomly chosen subsets of length 
N = 64 each from the ISI citation data set 0. The HTI was 
computed with internal statistics s = 10^ and n = 16. The 
main plot shows attractive correlations in all three cases, the 
inset verifies growth of these correlations with n. Bottom: 
Removing the largest and even the top two entries of data set 
2 does not change the result of the test. In data set 3, which is 
a somewhat extreme case in that the largest value is more than 
a factor 10 greater than the second largest, the correlations 
remain attractive upon removal of the largest entry but the 
magnitude of correlations no longer increases with n. 



an EVT universality class to the distribution of beneficial 
mutations in population genetics (jl. [lo|. In particular, 
the test can be used to strengthen the evidence in fa- 
vor of heavy-tailed behavior in situations where conven- 
tional parametric tests have insufficient statistical power. 
By combining our test with standard approaches such as 
the maximum likelihood method, the tail parameters can 
then also be estimated. 
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